Object Detection Based on Swin Deformable Transformer-BiPAFPN-YOLOX

Object detection technology plays a crucial role in people's everyday lives, as well as enterprise production and modern national defense. Most current object detection networks, such as YOLOX, employ convolutional neural networks instead of a Transformer as a backbone. However, these techniques lack a global understanding of the images and may lose meaningful information, such as the precise location of the most active feature detector. Recently, a Transformer with larger receptive fields showed superior performance to corresponding convolutional neural networks in computer vision tasks. The Transformer splits the image into patches and subsequently feeds them to the Transformer in a sequence structure similar to word embeddings. This makes it capable of global modeling of entire images and implies global understanding of images. However, simply using a Transformer with a larger receptive field raises several concerns. For example, self-attention in the Swin Transformer backbone will limit its ability to model long range relations, resulting in poor feature extraction results and low convergence speed during training. To address the above problems, first, we propose an important region-based Reconstructed Deformable Self-Attention that shifts attention to important regions for efficient global modeling. Second, based on the Reconstructed Deformable Self-Attention, we propose the Swin Deformable Transformer backbone, which improves the feature extraction ability and convergence speed. Finally, based on the Swin Deformable Transformer backbone, we propose a novel object detection network, namely, Swin Deformable Transformer-BiPAFPN-YOLOX. experimental results on the COCO dataset show that the training period is reduced by 55.4%, average precision is increased by 2.4%, average precision of small objects is increased by 3.7%, and inference speed is increased by 35%.


Introduction
Object detection represents one of the major concepts in the feld of computer vision. In everyday life, advanced object detection technology can be used for intelligent vehicle environment perception tasks to facilitate travel; in enterprises, it can be used for normal operations in specifc scenarios, such as parks and ports; in modern national defense, it contributes to better completing ofensive and defensive tasks. You Only Look Once (YOLO) is a representative algorithm in object detection, employing a convolution neural network (CNN) as the backbone for feature extraction. For example, YOLOv3 [1] and YOLOX [2] use Darknet-53 [1] as their backbone, while YOLOv4 [3] and YOLOv5 [4] use CSPDarknet53 [5]. However, these techniques are translation-invariant, locality-sensitive, and lacking a global understanding of images. Furthermore, CNNbased models use the pooling layers strategy for a dimensionality reduction to reduce the computational cost, which causes loss of a signifcant amount of meaningful information, such as the precise location of the most active feature detector.
Te Transformer is a model that relies entirely on the selfattention for natural language processing. Recently, a Transformer [6] achieved superior performance in language modeling and machine translation tasks. Te Transformer-based model has a larger receptive feld than that of the CNN, globally models the entire image and has a global understanding of images, which prompts people to start exploring the application of the language Transformer to the visual feld. Several studies [7][8][9][10][11][12] modeled the vision task as a dictionary lookup problem with learnable queries and used the Transformer encoder as a taskspecifc head on top of the CNN backbone. In image classifcation tasks, one study [13] was the frst to propose a Vision Transformer (ViT) method, by directly employing a standard Transformer to images by transforming images into patches instead of focusing on pixels, then inputting them to the Transformer encoder. As described in research [14], the image patches are treated as tokens in natural language processing applications. Tis leads to highly competitive results on the ImageNet dataset.
In the existing visual transformer research, ViT [13] is suitable for image classifcation tasks, which is an interesting and meaningful attempt to replace the CNN backbone with a convolution-free model. Although ViT [13] is applicable to image classifcation, its direct adaptation to pixel-level dense predictions such as object detection is challenging, because (1) its output feature map is single-scale and low-resolution, and (2) the excessive number of keys to attend per query patch yields high-computational cost and also increases the risk of overftting.
To avoid excessive attention computation, existing studies [15][16][17][18][19] leveraged carefully designed efcient attention patterns to reduce the computational complexity. Studies [17,20] downsample the key and value feature maps to save on computation costs. Although this method is effective, it is likely that relevant keys and values are dropped, while less important ones are kept, which causes important features to be lost. Te attention mechanism in one study [21] adaptively localized the object regions; however, its random reference points weakened the correlation between features, resulting in feature loss. Te Swin Transformer [22,23] adopts shift window-based attention to restrict attention in local windows. Despite their efectiveness, the hand-crafted attention patterns are data-agnostic, which limits their self-attention modeling ability, reduces feature extraction ability, and is suboptimal. Moreover, selfattention requires a long training period for adaptive learning to focus on object regions on the images, which results in slow convergence.
Notably, deformable convolution [24,25] can learn deformable receptive felds and attend to fexible spatial locations conditioned on input data. Tis was shown to be efective in selectively attending to more informative regions on a data-dependent basis. However, its deformable ofsets make the computational cost quadratic with the image size. Inspired by deformable convolution [24,25], Deformable DETR [8] improves the convergence and decreases the computation complexity of DETR [7] by selecting a small number of keys for each query. However, its attention is not suited to a visual backbone for feature extraction, because the lack of keys restricts representational power.
To solve the above problems, we pursue the design of a novel attention mechanism that adaptively focuses on important object regions and ignores nonimportant features, to improve the modeling ability and speed of the attention mechanism. With this attention mechanism, our Transformer has powerful feature extraction capabilities for important object regions and fast convergence capabilities during training for object detection tasks. Along with the above ideas, we propose the Reconstructed Deformable Self-Attention based on important regions, which shifts attention to important regions to capture more informative features for global modeling. Specifcally, frst, we use the grid points composed of patches as patch grid points; then, we use the query as input of the ofset network to generate the ofsets corresponding to all patch grid points; fnally, the patch grid points combine the ofsets and transfer key/value to the important regions. In this manner, the Reconstructed Deformable Self-Attention, which depends on the data pattern, can focus on more important and relevant object regions to capture a larger number of features more efciently. Tis improves the modeling ability and efectively shortens the modeling time. Based on the Reconstructed Deformable Self-Attention, we propose the Swin Deformable Transformer backbone. Compared with the Swin Transformer, the Swin Deformable Transformer can retain more important keys and values and ignore irrelevant ones, resulting in higher fexibility and efciency. Tis signifcantly reduces the training time while maintaining efcient feature extraction capabilities.
Te role of neck in YOLOX is to integrate features of diferent stages and scales, such that YOLOX has the ability to represent multiscale features. YOLOX's neck is PAFPN [26], which fuses diferent input features by summation. Diferent input features have diferent resolutions, and their contribution to the output features after fusion is not uniform. In actual calculation, the weights of each input feature are the same, which lessen the contribution of important features. To address the above issue, we use BiPAFPN [27] as the neck for YOLOX. BiPAFPN learns the importance of diferent input features by introducing learnable weights, thereby enhancing the contribution of important features.
In summary, our contributions are as follows: (1) We propose a novel Reconstructed Deformable Self-Attention based on important regions, which shifts attention to important object regions in the image, ignores unimportant features, and then performs efcient visual modeling of target features. (2) Based on the Reconstructed Deformable Self-Attention, we propose a novel backbone Swin Deformable Transformer, which speeds up the convergence while improving the feature extraction ability. (3) We design a novel transformer-based object detection network: Swin Deformable Transformer-BiPAFPN-YOLOX, which uses the Swin Deformable Transformer as the backbone of YOLOX and BiPAFPN as the neck of YOLOX. Tis study proposes the transformer as the backbone of YOLO for the frst time and obtains superior performance to the CNN.
Te unifed framework of CV and NLP was shown to promote the common development of the two felds. Te leap from the language transformer to the visual transformer 2 Computational Intelligence and Neuroscience promotes the joint modeling of visual and textual information. Te extensive applications of the transformer in various felds imply its realization of the functions of multimodal colearning and unifed modeling in the future.

Related Work
CNN is a standard network in the feld of computer vision. Since the introduction of AlexNet [28], CNN has become a mainstream. A previous study [29] proposed a CNN-based detection network used for behavior analysis and violence detection in the industrial Internet of Tings. Another study [30] proposed a 101-layer backbone ResNeXt-101 for feature extraction. Yet another study [31] proposed a lightweight backbone MobileNet for feature extraction, while reference [32] proposed a depthwise separable convolution [33] based network, which produces realtime high quality density maps and efectively counts people in extremely overcrowded scenes. However, these CNNbased networks lose meaningful information when employing the dimensionality reduction mechanism. Furthermore, this method requires powerful GPUs and a large amount of data for efective training. Recent studies found that the Transformer-based network has a larger receptive feld and modeling ability than CNN and can solve the above problems of CNN.

Transformer. Te pioneering work of visual transformer
ViT [13] frstly proposes to apply Transformer architecture on nonoverlapping image patches and achieves an impressive speed-accuracy trade-of on image classifcation compared to CNN. Meanwhile, ViT [13] requires large-scale training datasets (i.e., JFT-300M) and a large number of training epochs to perform efciently. DeiT [34] introduces several training strategies that allow ViT to be efective using the smaller ImageNet-1K [35] dataset. Te results of ViT on image classifcation are encouraging; however, its architecture is not suitable for use as a general-purpose backbone network on dense vision tasks or when the input image resolution is high, due to the attention low-resolution feature maps and the quadratic increase in complexity with image size. After the introduction of ViT [13], improvements focused on learning multiscale features for dense prediction tasks. CvT [36] adopts convolution in the tokenization process and utilizes stride convolution to reduce the computational complexity of self-attention. SepViT [37] uses novel window token embedding and grouped self-attention to model the attention relationship among windows with negligible computational cost. PVTv2 [17] reduces the computational complexity of PVTv1 [20] to linear by adding linear complexity attention layers, overlapping patch embedding, and convolutional feed-forward network.
Subsequently, focus has shifted to efcient attention mechanisms for multiple computer vision downstream tasks. Tese attention mechanisms include global tokens [15,38,39], focal attention [18], and dynamic token sizes [40]. Reference [41] sequentially proposes the spatial separable self-attention and cross-shaped window selfattention based on the hierarchical architecture. Deformable convolution [24,25] is a powerful mechanism to attend to sparse spatial locations conditioned on input data. Inspired by deformable convolution, Deformable DETR [8] combines the advantages of the sparse spatial sampling of deformable convolution and the relation modeling capability of transformer, efectively shortening the training period. However, its attention is not suited to a visual backbone for feature extraction, as the attention in Deformable DETR comes from simply learned linear projections, and keys are not shared among query tokens. Cswin transformer [16] and Swin Transformer [22,23] adopt the windowed attention and show improvements on downstream tasks. However, these attentions limit the ability to model long range relations, resulting in poor feature extraction results and slow convergence speed. Although attention and attention-based Transformer backbone are constantly being improved, they nevertheless fail to fully improve the ability of the attention to model target features, as well as to improve the feature extraction ability and convergence speed of the Transformer backbone. Based on this research, we propose our network to overcome the outstanding challenges.

Neck.
Multiscale feature aggregation represents features of diferent resolutions more efectively. PANet [26] adds a bottom-up path aggregation network on the top of FPN [42]. Tis solves the problem of restricted information fow in FPN [42]; however, important features in PANet [26] input are weakened. NAS-FPN [43] leverages neural architecture search to automatically design feature network topology. Despite achieving better performance, NAS-FPN [43] requires thousands of GPU hours during the search. BiPAFPN [27] introduces learnable weights to learn the importance of diferent input features, while repeatedly applying top-down and bottom-up multiscale feature fusion. It strengthens the contribution of important features and performs feature fusion more efciently.

Head.
Te detection head is responsible for predicting bounding boxes and object classes. Te two-stage detector proposed in reference [44] holds high precision but low realtime performance. Te one-stage detector proposed in reference [45] has high real-time performance but low precision and performs poorly on small-scale objects. YOLOX [2] proposed an object detection network based on the anchor-free, decoupled head, and label assignment strategy SimOTA, which avoids over-reliance on techniques such as anchor clustering [46] and grid sensitivity [47] signifcantly simplifying the detector and gaining its advanced performance.

Method
3.1. Overall Architecture. We propose Reconstructed Deformable Self-Attention as the attention module of the Swin Deformable Transformer and based on the Reconstructed Computational Intelligence and Neuroscience Deformable Self-Attention, we propose the Swin Deformable Transformer. We use the Swin Deformable Transformer as the backbone, BiPAFPN as the neck, and YOLOX as the detection head. Te entire object detection network architecture is shown in Figure 1. Firstly, the images are passed to the four-stage backbone Swin Deformable Transformer to complete multiscale feature extraction; then, the features of diferent scales are transmitted to the neck BiPAFPN for multiscale feature aggregation to achieve a comprehensive understanding of images; fnally, the features are transmitted to the YOLOX detection head, to predict the class, location, and bounding box confdence of objects.

Reconstructed Deformable Self-Attention.
Te Reconstructed Deformable Self-Attention based on important regions in the Swin Deformable Transformer is shown in Figure 2. It models the relations between patches under the guidance of important regions in the feature map. Tese important regions are determined by shifted sampling points. Due to the existence of shifted sampling points, these regions are assigned more local intensive attentions than other regions, which improve the modeling ability and capture important features more accurately.
First, features are input to generate patch grid points.

Patch Grid Points.
Patch grid points refer to the four grid points on each patch, and the important features in each patch are located through the bounding box composed of four points to ensure that they are not lost. On the one hand, this prevents the information loss of the entire feature map and reduces the computational cost and time cost of remeshing. On the other hand, because patches are located within the shifted windows, the shifted windows can interact with each other, which enhance the feature-to-feature correlation. Te values of patch grid points are linearly spaced 2D coordinates (0, 0), · · · , (P H − 1, P W − 1) , P H , and P W are the number of patches in the height and width directions, respectively. We normalize them to the range [− 1, +1] according to the grid shape P H × P W , where (− 1, − 1) pinpoints the top-left corner and (+1, +1) the bottom-right corner. Second, to obtain the ofsets of patch grid points, we project the feature through the projection matrix W query to obtain query (q) as in equation (1), and then input q to the ofset net θ offset () to generate an ofset ∆o as in equation (2). Under the guidance of important regions in the feature map, patch grid points combined with ofsets become shifted sampling points and migrate to important regions.
Tird, feature maps are sampled at the position of shifted sampling points to obtain the sample features S ′ as in equation (3). Ten, S ′ are projected by the projection matrices W key and W value to obtain shifted keys and shifted values respectively, as in equation (4).
where W query , W key , and W value are the projection matrices, k ′ and v ′ are the embeddings of shifted keys and shifted values, respectively. For equation (3), specifcally, we set the sampling function I(•; •) to a bilinear interpolation to render it differentiable as in equation (5): g p x , r x g p y , r y z r y , r x , : , and (r x , r y ) indexes all the locations on the feature map z ∈ R H×W×C .
(p x , p y ) are the coordinates of the patch grid points.
Fourth, we take relative position ofsets R and compute the attention of q, k, and v. Te output of single-head attention is shown in equation (6).
where m represents the attention head index, d is the dimension of each attention head, B∈ R (2H− 1)×(2W− 1) is the relative position bias table [22], from which we can index the relative position bias B; I(B; R) ∈ R HW×P H P W is the interpolation of B. We concatenate all attention heads and use W o to project them to obtain multihead attention as in equation (7).

Point Box Ofset.
Since the Reconstructed Deformable Self-Attention extracts image features around patch grid points, we use the detection head to predict point box ofsets between the center of the bounding box and patch grid points, as shown in Figure 3. Te implementation is as follows: we take the patch grid points as the initial inference points for the center point of the bounding box. Ten, we let the detection head predict the ofsets of the patch grid points relative to the actual center point of the bounding box, as shown in equation (8).
where p is the patch grid point, b is the ofset, b x,y,w,h { } ∈ R is predicted by the detection head. σ and σ − 1 denote the sigmoid and the inverse sigmoid function, respectively. Te usage of σ and σ − 1 is to ensure b is of normalized 4 Computational Intelligence and Neuroscience coordinates, as b ∈ [0, 1]. In this way, the learned Reconstructed Deformable Self-Attention has strong correlation with the predicted bounding box, which improves the detection precision.

Backbone: Swin Deformable Transformer. Te backbone
Swin Deformable Transformer based on the Reconstructed Deformable Self-Attention contains four stages, and its architecture is shown in Figure 4.   First, the output embedding patches of linear embedding in Figure 4 and position encode are passed to W-MRDSA. We use deformable relative position embedding as the position encode. On the one hand, it can cover all ofsets, and on the other hand, it is responsible for encoding the specifc position of the input sequence to prevent embedding patches from being out of order. Second, W-MRDSA is calculated by equation (9). Te input of W-MRDSA is a series of key, query, and value vectors. We use the scaled cosine attention method to calculate the similarity between queries and keys, and obtain the weight A i corresponding to the key. We use the SoftMax function to normalize the weight to obtain the weight coefcient C i . We perform weighted summation on the value to obtain the value of selfattention SA. We project the concatenated outputs of all selfattention to obtain the outputs of W-MRDSA.
In equation (9), B ij is the relative position bias between pixel i and pixel; τ is a learnable scalar with a value greater than 0.01; W o being the projection matrix. Te scaled cosine attention method stabilizes the training process and improves the precision. Tirdly, features are passed to the normalization layer (LN). Te data must be normalized before training the neural network, which speeds up the training and improves stability of the training process. We employ the res-post-norm method: placing LN before SA avoids excessive activation amplitudes between layers, stabilizing the training process and improving precision. Fourth, the features are passed into the multilayer perceptron (MLP), such that the nonlinear model realizes linear transformation of the features dimension. Te features are then passed into the next LN. Tere are two residual connections in each block in Figure 5, used to prevent network degradation during training, which increases the   Figure 4: Swin Deformable Transformer network architecture. 6 Computational Intelligence and Neuroscience training error. Fifth, the features are transferred to block l + 1, and subsequently, the SW-MRDSA is calculated. Te subsequent processes are the same as those of block l. Te calculation of two consecutive Swin Deformable Transformer Blocks is shown as equation (10).
where z l and z l are the output features of (S)W-MRDSA and MLP modules in block l, respectively. Te computational cost of MRDSA is calculated by equation (11).
where HW is the product of the height and width of each image, C is the dimension, P H and P W are the number of patches in the height and width directions, and k is the number of sampling keys. Te total computational cost is linear to the image size. Tus, our proposed Swin Deformable Transformer is suitable for dense prediction vision tasks that require input high-resolution images.
Te Swin Deformable Transformer divides the image into windows and calculates attention within them. Simultaneously, it establishes the connection based on shifted windows, which enables the interaction between the windows of each layer. At this time, self-attention based on shifted windows has a global modeling capability and can capture more information, as shown in Figure 6.
Layer I uses a regular window partition strategy, where the image is divided into four windows and each window has 4 × 4 patches. After the regular window partition of Layer I is cyclically shifted to the upper left, new windows of layer II are generated. Self-attention of the new windows crosses the boundary of the regularly partitioned ones, making the windows interact with each other, and thus, the global modeling ability is obtained. For example, window5 enables window1 to interact with window3, window9 enables window1, window2, window3, and window4 to interact with each other.

Patch Merging.
In stage 2, we build the patch merging mechanism to generate multiscale features, as shown in  [27] is a lightweight and efcient network that enables simple and fast multiscale feature fusion. Te structure of BiPAFPN is shown in Figure 8 in comparison with PAFPN [26], and its main features are as follows.

Neck: BiPAFPN. BiPAFPN
First, BiPAFPN removes nodes that only have one input edge. For example, it removes the one-way node between A and B in Figure 8. Tis is because if a node has only one input edge with no feature fusion, it has less contribution to the feature network that aims at fusing diferent features. Tis forms a bidirectional information transfer network of upsampling and downsampling on the PAFPN. Second, BiPAFPN adds residual connections from the original input to output node if they are at the same level, in order to fuse more features. Tird, it treats each bidirectional (top-down and bottom-up) path as one feature layer and repeats the same layer multiple times to enable further high-level feature fusion.
Diferent input features have diferent resolutions, and they usually contribute to the output feature unequally. Tus, we add an additional weight for each input, and let the Computational Intelligence and Neuroscience network learn the importance of each input feature. Finally, we adopt a fast normalized fusion strategy, as shown in equation (12): w i is a learnable weightand ϵ � 0.0001 can avoid numerical instability. After the P 6 level feature fusion in Figure 8, the output is shown in equations (13) and (14): P td 6 is the intermediate feature of the top-down path of level P 6 , P out 6 is the output feature of the bottom-up path of level P 6 , Resize() denotes the upsampling or downsampling of adjusting features of diferent resolutions to the same resolution, and DSConv is the depthwise separable convolution. [2] uses the anchor-free mechanism, decoupling head, and label assignment strategy SimOTA, which simplifes the training and decoding process of the detector while improving the detection precision. Its structure is illustrated in Figure 9.

Head: YOLOX. YOLOX
YOLOX adopts the decoupled head [48] to decouple the classifcation and regression tasks into two branches, and it adds an IOU branch to the regression branch. Te classifcation branch predicts the object categories, and the regression branch predicts the coordinates of the center point of the bounding box, while the IOU branch predicts the confdence. Its implementation process is as follows: frst, the channel dimension of features is reduced through 1 × 1 convolution, after which the features are passed to two parallel branches, each of which has two 3 × 3 convolutional layers. Subsequently, they are passed to the classifcation, regression, and IOU branches, respectively. Finally, we connect the results of these three branches to obtain the training or inference result. Te decoupled head can not only speed up the convergence, but also improve the detection precision of YOLOX.

Computational Intelligence and Neuroscience
Te process of switching YOLO to an anchor-free manner is as follows: YOLOX reduces the predictions for each location from three to one and makes the detection head directly predict four values, i.e., two ofsets in terms of the left-top corner of the grid and the height and width of the predicted box. It assigns each object a positive sample area centered on the object and predefnes a scale range, where all samples within this area are positive samples. Tis alleviates the extreme imbalance of positive and negative samples during training. Te anchor-free mechanism improves the precision of object detection and signifcantly simplifes the detector.
YOLOX proposes a dynamic Top-k label assignment strategy SimOTA, which optimizes the matching method between the prediction (p) and ground truth (g). Te SimOTA implementation process is as follows: frst, SimOTA uses the get assignments to obtain the ground true label. SimOTA calculates the pairwise matching degree, and the match is represented by cost for each p-g pair. For example, the cost between p j and g i is calculated as follows: where L cls ij , L reg ij , L iou ij , and L L1 ij are the classifcation loss, regression loss, confdence loss, and L1 norm loss between the prediction and the ground truth, respectively. λ is the balance coefcient. Ten, we select the k prediction with the smallest cost in the fxed central region as positive samples.

Evaluation
Metrics. AP represents the average precision, AP S represents the AP of small-scale objects with area < 32 2 , AP M represents the AP of medium-scale objects with 32 2 < area < 96 2 , and AP L represents the AP of largescale objects with area > 96 2 . Epoch represents the training period. To calculate AP 50 and AP 75 , the intersection over union (IoU) is used to set the thresholds of the ground truth and prediction boxes. Te formula for IoU can be expressed as follows: where B p is the predicted box, and B gt is the ground truth box. AP 50

Implementation Details.
We set up a two-stage training strategy. We train the backbone Swin Deformable Transformer in the frst stage. First, we pretrain the Swin Deformable Transformer backbone on the ImageNet-1K dataset [35] for initialization, then train it on the COCO training set. We use the AdamW optimizer [50] and soft-NMS [51]. Te initial learning rate is set to 3 × 10 − 4 , the weight decay parameter is 0.05, the batch size is set to 16, and a 5× schedule is used to train for a total of 60 epochs. Te learning rate decays by 0.1 at 40 epochs and 52 epochs. We train the overall network Swin Deformable Transformer-BiPAFPN-YOLOX in the second stage. We, frstly, conduct a 6 epochs warmup on the COCO training set. We use stochastic gradient descent (SGD) for training. Te initial learning rate is set to 2 × 10 − 3 and increases to 2 × 10 − 2 by a factor of 10 at 6 epochs. Ten, a learning rate of lr × BatchSize/64 (linear scaling) is employed, with a cosine lr schedule. Te weight decay is 0.0005, and the SGD momentum is 0.9. Te batch size is 128 on 8 V100 GPUs, for a total of 96 epochs. We use the DeiT [34] data augmentation strategy, adopt Mosaic MixUp, and multiscale training [52] to improve the performance of YOLOX and turn of data augmentation in the last 15 epochs [2]. Te implementation of our network is based on MMDetection [53].

Ablation Experiment.
To study the efect of diferent components on the performance of our network, we conduct ablation experiments and report the results in Table 1  AP and AP S of Swin Transformer-PAFPN-YOLOX are 1.0% and 1.8% higher than that of DarkNet53-PAFPN-YOLOX, respectively, indicating that the Swin Transformer extracts the features more efectively than DarkNet53. Te AP and AP S of Swin Deformable Transformer-PAFPN-YOLOX are 0.7% and 1.4% higher than that of Swin Transformer-PAFPN-YOLOX, respectively, indicating that our Swin Deformable Transformer performs better than the Swin Transformer. Te fnal Swin Deformable Transformer-BiPAFPN-YOLOX was 0.6% and 0.5% higher than the Swin Deformable Transformer-PAFPN-YOLOX in AP and AP S , respectively, indicating that using BiPAFPN as YOLOX' neck outperforms PAFPN. Compared with the original network, our network has fewer parameters, the model complexity is reduced by 30.1%, the computational complexity is reduced by 40.3%, and the inference speed is increased by 10.0%.
To investigate the efect of using Reconstructed Deformable Self-Attention at diferent stages on the performance of our algorithm, we conduct ablation experiments. Te Swin Transformer has a total of four stages, and we use the Reconstructed Deformable Self-Attention in the frst block of each stage. On this basis, we use the Reconstructed Deformable Self-Attention in turn in the second block of each stage. Te results are shown in Table 2. When the Reconstructed Deformable Self-Attention is used in all blocks of each stage, AP and AP S are increased by 1.3% and 1.9%, respectively, computational complexity and model complexity are reduced by 47.7% and 44.0%, respectively, while the inference speed is increased by 35.2%. Tis demonstrates that using the Reconstructed Deformable Self-Attention in all blocks of each stage enables our network to achieve the best results.
We propose two schemes: in scheme (A), we use patch grid points instead of random points; in scheme (B), we let the detection head predict the point box ofsets of the patch grid point relative to the center point of the bounding box. To verify schemes efectiveness, we conduct ablation experiments, and the results are shown in Table 3. Te AP and AP S of scheme A increase by 0.5% and 0.7%, respectively, the AP and AP S of scheme B increase by 0.6% and 0.9%, respectively. When scheme A and scheme B work together, AP and AP S are increased by 0.9% and 1.4%, respectively, and model complexity is reduced by 36.1%, computational complexity is reduced by 29.6%, and inference speed is increased by 32.3%. Tis shows that our two proposed schemes can signifcantly improve the detection precision and inference speed of the network and reduce the model and computational complexities.

Quantitative Analysis.
To verify whether our network can efectively shorten the training period while maintaining high performance, we conduct comparative experiments and show the results in Table 4. Te Swin Deformable Transformer-BiPAFPN-YOLOX has 1.3% and 1.9% higher AP and AP S than the Swin Transformer-BiPAFPN-YOLOX, respectively, while the training epochs are reduced by 55.4%. Simultaneously, the model and computational complexities are reduced by 44.0% and 47.7%, respectively, and the inference speed is increased by 26.0%. Tis indicates that the Swin Deformable Transformer-BiPAFPN-YOLOX can effectively reduce training time while maintaining high performance. Figure 10 shows the convergence curves of the Swin Deformable Transformer-BiPAFPN-YOLOX and Swin Transformer-BiPAFPN-YOLOX. Te former is represented by the red line, and the latter is represented by the blue line. Te Swin Deformable Transformer-BiPAFPN-YOLOX has reached the convergence state around 150 epochs, and its AP is close to 50% at this time; the Swin Transformer-BiPAFPN-YOLOX converges around 350 epochs, and its AP is signifcantly lower than the former. It also exhibits that the rise of the red line is smoother than that of the blue line. Tis indicates that our Swin Deformable Transformer-BiPAFPN-YOLOX has faster convergence speed and a more stable training process.

Qualitative Analysis.
To compare the performance of our network and the original YOLOX [2] network more comprehensively, we expand the parameter scales of our network according to the same scaling rules as YOLOX [2] and obtain four kinds of networks S, M, L, and X that increase in size sequentially. Comparative experiments were carried out on the COCO 2017 test set, and the results are shown in Table 5. Te AP and AP S of Swin Deformable Transformer-BiPAFPN-YOLOX are 5.1%, 2.0%, 1.8%, 0.9% and 3.2%, 2.4%, 1.9%, 0.7% higher than the corresponding scale Darknet53-PAFPN-YOLOX, respectively, while the computational complexity is reduced by 39.2%, 31.4%, 7.1%, and 21.4% respectively, and the inference speed is increased by 20.0%, 28.4%, 35.8%, and 22.8%, respectively. Tis indicates that compared with the original YOLOX [2] network, our network holds higher precision, lower computational complexity, and faster inference speed. Te detection precision of small-scale objects is likewise signifcantly improved.

Quantitative Analysis.
To more objectively evaluate the performance of our networks, we compared our four parameter scale networks of S, M, L, and X with other corresponding scale networks. Te results are shown in        Computational Intelligence and Neuroscience 13 by each other. Figure 13 shows that in original, there are some missed detection objects (False Negative) and falsely detected objects (False Positive) when the objects occlude from each other. Hence, our network boasts higher precision on large-scale objects, indicating that it is more advanced. Class activation mapping (CAM) maps are also referred to as attention maps, and they represent the distribution of the contribution of the prediction result. Te important regions on the image are marked as highlighted regions.
Brighter regions indicate higher scores, and larger attention weight assigned to them, as well as a more precise prediction result. We adopt Score-CAM [55] to visualize the attention maps of the Swin Transformer-BiPAFPN-YOLOX and Swin Deformable Transformer-BiPAFPN-YOLOX. Figure 14 shows the attention maps of four stages of the Swin Deformable Transformer-BiPAFPN-YOLOX and Swin Transformer-BiPAFPN-YOLOX. In the frst stage, compared with Swin Transformer-BiPAFPN-YOLOX, the attention of Swin   been transferred to the important regions. Larger attention weights assigned to these regions yield more precise prediction results.

Conclusion
Object detection technology plays a crucial role in people's everyday lives, enterprise production, and modern national defense. Tis study explores how to apply an attention-based transformer more efectively to the object detection task. We propose an attention mechanism based on important regions, named Reconstructed Deformable Self-Attention, which shifts attention to important object regions, ignores unimportant features, and achieves more efcient global modeling. Based on the Reconstructed Deformable Self-Attention, we propose a novel backbone named the Swin Deformable Transformer, which improves the feature extraction ability and convergence speed of the backbone. Based on the Swin Deformable Transformer backbone, we propose a novel object detection network Swin Deformable Transformer-BiPAFPN-YOLOX. Tis study further represents the introduction of a Transformer into the object detection network of YOLO series. Te experimental results show that compared with the previous state-of-the-art methods on the object detection benchmark of the COCO2017 dataset, our Swin Deformable Transformer-BiPAFPN-YOLOX can signifcantly boost the detection precision, inference speed, and convergence speed, especially in small object detection. Furthermore, the detected precision increases with the model complexity increases, whereas the inference speed decreases. Tus, high precision and real-time performance cannot be satisfed at the same time. In the future, with the rapid development of computer hardware, we believe that this problem can be efciently solved. Te Swin Deformable Transformer is a multitask backbone. One future research direction is to explore the performance of the Swin Deformable Transformer for segmentation and classifcation; another is to explore its 3D environment perception performance in multimodal fusion tasks from the perspective of Bird's-eye-view (BEV).

Data Availability
Te data used to support the fndings of this study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest.